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ABSTRACT 

Bacterial tyrosine-kinases share no resemblance 
with their eukaryotic counterparts and they have 
been unified in a new protein family named 
BY-kinases. These enzymes have been shown to 
control several biological functions in the bacterial 
cells. In recent years biochemical studies, sequence 
analyses and structure resolutions allowed the 
deciphering of a common signature. However, 
BY-kinase sequence annotations in primary data- 
bases remain incomplete. This prompted us to 
develop a specialized database of computer- 
annotated BY-kinase sequences: the Bacterial 
protein tyrosine-kinase database (BYKdb). 
BY-kinase sequences are first identified, thanks to 
a workflow developed in a previous work. A second 
workflow annotates the UniProtKB entries in order 
to provide the BYKdb entries. The database can be 
accessed through a web interface that allows static 
and dynamic queries and offers integrated 
sequence analysis tools. BYKdb can be found at 
http://bykdb.ibcp.fr. 



INTRODUCTION 

Reversible protein phosphorylation is a major mechanism 
in the regulation of fundamental signaling events in all 
living organisms. In bacteria, four types of phosphoryl- 
ation systems were described: (i) the two-component 
system (1), (ii) the phosphoenolpyruvate (PEP) transferase 
system (PTS) (2), (iii) the eukaryotic-like system (3) and 
(iv) the bacterial tyrosine kinase (BY-kinase) system (4). 
Those four types differ in the nature of the 



phosphorylated amino acids (S, T, Y, H and D) and the 
phosphate donor (adenosine tri-phosphate (ATP) or 
PEP). The most recently discovered system involves the 
BY-kinase family. 

BY-kinases comprise two domains: a two-pass trans- 
membrane activator domain (TAD) that includes a large 
extra cellular part, and an intracellular catalytic domain 
(CD). The BY-kinases CD encompasses three Walker-like 
motifs (5), called A, A' and B, and a C-terminal 
tyrosine-rich region named Y cluster (YC). 

These two domains can either be linked in a single poly- 
peptide (in proteobacteria and actinobacteria), or split 
into two different proteins, encoded by two adjacent 
genes (in firmicutes). In this case, it has been shown that 
BY-kinases activity was effective only if the CD was inter- 
acting with the region following the second transmem- 
brane segment of TAD. 

A number of studies have demonstrated the importance 
of BY-kinases in several facets of the physiology of the 
bacterial cell. More precisely, their best characterized 
function concerns the regulation of the export and the 
biosynthesis of bacterial extracellular polysaccharide that 
are recognized important virulence factors. Since these 
enzymes are not homologs of their eukaryotic counter- 
parts, BY-kinases are particularly interesting in the 
search of new therapeutic targets to combat bacterial 
pathogens. 

We present here the Bacterial tyrosine kinase database 
(BYKdb), developed in order to collect, store and manage 
BY-kinase sequences with standardized annotations. The 
sequences are identified by using an in silico workflow 
described in a previous work (6). Diverse sequences 
identified by the workflow have been validated experimen- 
tally since then (Dr Mijakovic, personal communication). 
The sequences are automatically annotated and stored in 
the database by a second automated workflow. The 
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database can be queried via a WWW interface and the 
query results can be further analyzed with the numerous 
integrated analysis tools. 

The BYKdb is updated on a monthly basis from the 
UniProt Knowledgebase (UniProtKB) (7), and stored in 
the PostgreSQL relational database management system 
(RDBMS). The programs for the sequence identification 
and annotation, as well as for the querying and the man- 
agement of the database are implemented in Java and 
SQL programming languages. 



The first step of the database building procedure is the 
identification of BY-kinase sequences, thanks to the 
'FindBYK' process. It relies on the HMMER software 
package (version 3.0 of March 2010) (8). The BY- 
kinases TAD is matched by the Pfam profile PF02706 
(9), in a region including the first transmembrane 
segment. Since this profile identifies proteins that are 
not BY-kinases, and does not identify CD of two-parts 
BY-kinases, it is not sufficient to unambiguously identify 
BY-kinases. Thus, a second profile (6) is used to identify 
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Figure 1. Snapshot of the BYKdb web interface. (A) The BYKdb home page, with the resources accessible as clickable buttons (B) A pre-computed 
dataset of BY-kinase and TAD sequences and alignments, ordered by phylum and by cluster. (C) The multiple sequence alignment of firmicutes CD 
viewed in the 'EditAlignment' applet. (D) Dynamic queries against the BYKdb. (E) The 'Annotate' tool detailed results page for one input sequence. 
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the CD. The two 'hmmsearch' result sets are analyzed to 
lead to three clusters of sequences. The first cluster is 
constituted by sequences matching both HMM profiles 
('TADCD' cluster). The second cluster contains sequences 
with the TAD and CD encoded by two proteins ('TAD' 
and 'CD' cluster). The sequences are grouped according to 
their successive accession numbers (AC). The sequences of 
the third cluster only match the CD profile and have no 
identified TAD ('orphanCD' cluster). The sequences are 
further divided into two categories ('confirmed' and 
'unsure') according to a filter (isBYK). This filter checks 
if a sequence contains all the necessary CD motifs 
(Walker-like motifs and tyrosine cluster) with the right 
spacing between them (6). The 'unsure' sequences are 
not integrated in the database. 

The 'AnnotateBYK' process ensures the second step of 
the procedure, i.e. the sequence annotation. The annota- 
tion starts with the UniProtKB entries extracted according 
to the AC of the relevant sequences identified in the first 
step. The entries are cleared from some data and new data 
are added. Indeed, the accession numbers, the creation 
date, the organism and the sequence of each entry are 
conserved. In the description line (DE), the recommended 
name field is set to 'Bacterial tyrosine kinase' or to 
'Adaptator of bacterial tyrosine kinase' depending on 
the type of sequences. In the keywords list (KW) are 
added the 'byk' abbreviation and the cluster of sequences: 
'cd', 'tad' or 'tadcd'. The domains and motifs with their 
start and end positions in the sequence are also added to 
the feature (FT) lines, respectively as domain and site 
fields. Besides, for each sequences of the second group, a 
cross-reference to the related TAD entry is added and re- 
ciprocally for the TAD entry. The BY-kinase entries are 
then loaded in the relational database. 



WEB INTERFACE 

The BYKdb is accessible through a website (http://bykdb 
.ibcp.fr) (Figure 1). The user can navigate through the site, 
thanks to the menubar or the home page buttons. The site 
is divided into static and dynamic parts. 

In the static part, the user can find general information 
about BY-kinases with links to other external resources 
('BYK' link). The user can also access pre-computed 
datasets ('Dataset' link) sorted according to the phylum 
and BY-kinase cluster. The datasets include BY-kinase 
sequences in Pearson/Fasta format and the corresponding 
multiple sequence alignment. The user can download 
the corresponding files for further analysis. Furthermore, 
the alignments can be viewed and edited with the 
'EditAlignment' applet developed by our team. In the dy- 
namic part ('Dynamic' link), the user can extract his own 
datasets by combining multiple criteria (e.g. species and 
sequence length and sequence cluster). The datasets can be 
exported as Fasta/Pearson sequences, accession number 
lists and entry fiat files for further analysis with the 
integrated analysis tools. 

The available analysis tools are either generic or 
specialized. The generic analysis tools (e.g. BLAST or 
Clustal W) are available through the NPS@ server (10), 



which is an integrated sequence analysis web server. The 
specialized analysis tool ('Annotate' link) available on the 
site allows the annotation of one or several sequences, 
which can be useful to scan protein sequences from a 
newly sequenced bacterial genome in order to identify 
BY-kinases and their adaptators. In the results page, the 
user can access the start/end positions and the sequences 
of the different domains and sites. Furthermore, the 
domains and sites are mapped in color on the BY-kinase 
sequence. 

STATISTICS 

BYKdb has been available since July 201 1. The release 3.0 
(October 2011) comprises 2,746 sequences, including 713 
CD sequences with their corresponding TAD sequences, 
1,169 TADCD sequences and 151 orphanCD sequences. 

CONCLUSION AND PERSPECTIVE 

BYKdb collects BY-kinase sequences, identified and 
annotated, thanks to a computer-automated system. The 
automatic annotation process used to generate BYKdb 
entries guarantees harmonized annotations and allows ef- 
ficient keyword searches. The BYKdb website gives access 
to pre-computed static datasets and allows dynamic 
queries. The extracted data can be further analyzed with 
generic or specialized bioinformatics algorithms. In the 
future, the annotations will be enriched with new data 
(e.g. information about protein structure) and new 
analysis tools (e.g. phylogeny) will be added. 
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